## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X Yosemite 10.10.5
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.4 magrittr_1.5 rprojroot_1.1 tools_3.3.2
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.10 stringi_1.1.2
## [9] rmarkdown_1.3 knitr_1.15.1 stringr_1.1.0 digest_0.6.11
## [13] evaluate_0.10
This notebook documents the education attainment levels and age-adjusted death rates (AADR) for different causes of death across the United States. The data has been organized to explore the relationship between education attainment and AAMR between 1999 and 2013.
How to download dataset from Data.World:
How to setup for ETL
The ETL script joins the Pre_ETL_Death dataframe with the education dataframe from US Census by AreaName (the state name)
## Loading required package: readr
## Loading required package: plyr
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: data.world
##
## Attaching package: 'data.world'
## The following object is masked from 'package:dplyr':
##
## query
## Loading required package: DT
## Parsed with column specification:
## cols(
## YEAR = col_integer(),
## `113_CAUSE_NAME` = col_character(),
## CAUSE_NAME = col_character(),
## STATE = col_character(),
## DEATHS = col_character(),
## AADR = col_character()
## )
## Classes 'tbl_df', 'tbl' and 'data.frame': 13261 obs. of 19 variables:
## $ State : chr "AL" "AL" "AL" "AL" ...
## $ AreaName : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ edu.total : int 3239351 3239351 3239351 3239351 3239351 3239351 3239351 3239351 3239351 3239351 ...
## $ edu.males : int 1532613 1532613 1532613 1532613 1532613 1532613 1532613 1532613 1532613 1532613 ...
## $ edu.females: int 1706738 1706738 1706738 1706738 1706738 1706738 1706738 1706738 1706738 1706738 ...
## $ m_no_school: int 21478 21478 21478 21478 21478 21478 21478 21478 21478 21478 ...
## $ m_hs : int 490120 490120 490120 490120 490120 490120 490120 490120 490120 490120 ...
## $ m_bs : int 226204 226204 226204 226204 226204 226204 226204 226204 226204 226204 ...
## $ m_ms : int 82686 82686 82686 82686 82686 82686 82686 82686 82686 82686 ...
## $ m_phd : int 19299 19299 19299 19299 19299 19299 19299 19299 19299 19299 ...
## $ f_no_school: int 20398 20398 20398 20398 20398 20398 20398 20398 20398 20398 ...
## $ f_hs : int 515175 515175 515175 515175 515175 515175 515175 515175 515175 515175 ...
## $ f_bs : int 252608 252608 252608 252608 252608 252608 252608 252608 252608 252608 ...
## $ f_ms : int 119311 119311 119311 119311 119311 119311 119311 119311 119311 119311 ...
## $ f_phd : int 13033 13033 13033 13033 13033 13033 13033 13033 13033 13033 ...
## $ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
## $ cause : chr "Unintentional Injuries" "All Causes" "Alzheimer's disease" "Homicide" ...
## $ amt_death : chr "2313" "44806" "772" "438" ...
## $ AADR : chr "52.17" "1009.30" "17.80" "9.87" ...
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## [1] "edu.total"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "edu.males"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "edu.females"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "m_no_school"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "m_hs"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "m_bs"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "m_ms"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "m_phd"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "f_no_school"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "f_hs"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "f_bs"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "f_ms"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "f_phd"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "year"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "amt_death"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "AADR"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
This is a boxplot for each of the causes of death. The x-axis shows every cause of death, and the y-axis are the counts of the sepcific deatch cause for every state every year. The Data shows that HEart Disease and Cancer show the highest counts of death and also the greatest variabiliy. Box plots for other causes of death have less variability and less counts.
This is a histogram of AADR by state. The x-axis are AADR bins starting at 580 and ending at 980 with increments of 20. The y-axis are the counts of states that fall into each bin. The histogram looks to be normal distribution with a small skew to the right. Additionally, there is a mini cluster in the 880-960 AADR range. Most states fall into the 700-800 AADR range.
Scatterplot displays aggregate AADR vs fraction of BS attainment for each state. Note the negative correlation between AADR and BS attainment.
Barchart displays the sum of the AADR per cause from 1999-2013. The average sum of AADR for all causes do not vary greatly per year. Heart Disease, Cancer, and Stroke have consistently higher sums of AADR than the average AADR for all causes.
High Level Barchart displays females who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted high levels of female BS attainment (0.15-0.25). Compared to males, females also had a greater amount of states with high levels.
Low Level Barchart displays females who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted low levels of female BS attainment (<0.15).
High Level Barchart displays males who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted high levels of male BS attainment (0.15-0.25).
Low Level Barchart displays males who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted low levels of male BS attainment (<0.15).
US map of states filled by Bachelor’s degree attainment (top) and AADR (bottom). Note that localization of low degree attainment and high AADR coincide.
This visualization displays the use of Tableau Data Blending and Set IDs. “total_100G” is a measure created using the formula “(m_100G_more + f_100G_more) / total”, which outputs the proportion of individuals who made $100,000 or more within a particular state. Afterwards, the set “max_high_income” contains the states that had a proportion of 0.10 or higher. Those in “max_high_income” had similar AADR values (range: 19,316 in Connecticut and 21,938 in Virginia) with the exception of D.C. which hosted a relatively high AADR (24,620). On the other hand, the states that were not in the set had a wider range of AADR values. It was interesting to observe that Mississippi had the highest AADR (27,595) and Hawaii had the lowest AADR (17,222) while remaining in the set of states that did not have a large proportion of individuals with higher incomes. Our data suggests that states with a higher proportion of individuals who make $100,000 or more host similar AADR while states with a lower proportion host a range of AADR. The range could be explained by causes not included in our dataset, but is worth further exploring with different census datasets.
This is a boxplot for each of the causes of death. The x-axis shows every cause of death, and the y-axis are the counts of the sepcific deatch cause for every state every year. The Data shows that HEart Disease and Cancer show the highest counts of death and also the greatest variabiliy. Box plots for other causes of death have less variability and less counts.
This is a histogram of AADR by state. The x-axis are AADR bins starting at 580 and ending at 980 with increments of 20. The y-axis are the counts of states that fall into each bin. The histogram looks to be normal distribution with a small skew to the right. Additionally, there is a mini cluster in the 880-960 AADR range. Most states fall into the 700-800 AADR range.
Scatterplot displays aggregate AADR vs fraction of BS attainment for each state. Note the negative correlation between AADR and BS attainment.
Full Barchart - Note the gradually decreasing average AADR line (red). This shows the steady lowering of death rate over time.
Zoomed excerpt of the barchart. Note that heart disease and cancer are the leading causes of death across these states.
High Level Barchart displays females who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted high levels of female BS attainment (0.15-0.25). Compared to males, females also had a greater amount of states with high levels.
Low Level Barchart displays females who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted low levels of female BS attainment (<0.15).
High Level Barchart displays males who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted high levels of male BS attainment (0.15-0.25).
Low Level Barchart displays males who’ve attained a Bachelor’s Degree. The page only shows the set of states that hosted low levels of male BS attainment (<0.15).
US map of states filled by Bachelor’s degree attainment (top) and AADR (bottom). Note that localization of low degree attainment and high AADR coincide.